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Abstract:  A  recently  developed  condition-based  maintenance  model  is  described  which 
utilises  reliability  data  combined  with  condition  monitoring  measurements  to  predict  the 
remaining  useful  life  of  critical  components  in  a  steelworks  hot  strip  mill.  The  results 
obtained  from  several  case  studies  are  presented  which  will  show  how  the  model  can  be 
used  as  part  of  a  condition-based  maintenance  strategy. 
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Introduction:  In  a  highly  competitive  industry,  steel  works  management  has  to 
continually  focus  on  achieving  increased  product  performance,  quality  and  efficiency  in 
order  to  maintain  a  fair  share  of  the  available  market  and  improve  its  customer  base.  In  an 
integrated  steelworks  complex,  the  hot  strip  mill  is  constantly  a  crucial  area  of  operation  in 
which  unscheduled  failure  or  breakdown  of  machinery  can  critically  affect  production 
down  time  and  associated  risk  of  a  reduction  in  finished  goods  quality. 

For  several  years,  steel  companies  in  the  UK  have  practised  condition-based  maintenance 
in  strategically  vital  areas  such  as  the  hot  strip  mill.  The  methods  of  monitoring  utilised 
cover  virtually  the  whole  spectrum  of  activity;  these  include  vibration  analysis,  oil  and 
wear  debris  analysis,  and  performance  monitoring  using  numerous  techniques  to  measure, 
e.g.,  motor  current,  temperature,  etc. 

The  present  utilisation  of  these  methods  enables  plant  maintenance  personnel  to  detect  and 
also,  very  often,  diagnose  pending  failure  of  equipment.  What  they  are  unable  to  do  with 
much  certainty  is  to  predict  the  remaining  useful  life  of  failing  components. 

The  predictive  model  described  in  this  paper  has  been  developed  on  the  assumption  that 
the  failure  pattern  can  be  divided  into  two  distinct  phases:  stable  and  unstable,  which  can 
be  distinguished  by  using  statistical  process  control  methods.  Depending  on  the  way  in 
which  the  machinery  progresses  to  failure,  one  of  two  methods  is  employed  to  predict  the 
remaining  machine  life.  The  first  is  based  entirely  on  a  reliability  model,  while  the  second 
method  uses  a  novel  combination  of  reliability  and  condition  monitoring  measurements  to 
narrow  down  the  time  to  failure  ‘window’. 
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After  describing  the  methodology  used  to  generate  the  predictive  model,  the  results  of 
several  case  studies  will  be  presented  which  will  serve  to  show  how  it  can  be  utilised  as 
part  of  a  condition-based  maintenance  strategy  on  the  hot  strip  mill. 

Development  of  a  prediction  model  theory: 
a)  Some  basic  aspects 

For  the  purpose  of  identifying  that  a  potential  failure  problem  exists  in  the  hot  strip  mill, 
normal  alarm  limits  are  utilised  on  which  the  levels  are  periodically  adjusted  based  on 
factors  such  as;  operational  experience,  machine  supplier  recommendations,  previous 
failure  data,  or  national/intemational  standards.  The  problems  imposed  by  reliance  on 
these  methods  is  that  if  the  alarm  limits  are  set  too  high,  the  machine  may  fail  without 
sufficient  advanced  warning.  If  the  limits  are  set  too  low,  the  machine  will  generate  false 
alarms  that  can  obscure  a  true  warning  until  it  is  too  late.  Experienced  machine  operators 
and  maintenance  personnel  learn  from  experience  how  to  distinguish  between  false  and 
true  alarms.  However,  to  try  and  mimic  such  experience  through  the  development  of  a 
model  to  predict  failure  is  beset  by  a  number  of  difficulties. 

Some  of  these  difficulties  may  be  addressed  by  employing  a  statistical  process  control 
(SPC)  approach  which  can  be  utilised  to  distinguish  data  in  terms  of  stable  or  unstable 
regions  by  setting  suitable  alarm  limits.  This  requires  the  observation  of  at  least  30  data 
points  and  from  which,  the  number  of  false  alarms  is  expected  to  be  reduced.  However, 
this  is  machine  and  process  dependent  and  needs  to  be  conducted  for  each  individual 
situation.  It  is,  therefore,  a  time-consuming  activity  which  may  be  alleviated  to  some 
extent  by  comparing  measurements  for  a  group  of  otherwise  similar  machines,  thereby 
providing  a  larger  population  of  failures  from  which  to  extract  data  and  establish  realistic 
alarm  limits. 

Cumulative  Summation  (CuSum)  is  a  well  known,  sensitive  method  used  to  identify  small 
changes  in  the  average  value  of  a  data  set.  It  is  also  a  useful  technique  for  smoothing  data 
and  enhancing  any  fundamental  changes  occurring  in  the  process  characteristics.  Similar  to 
SPC  control  limits,  a  ‘v-mask’  can  be  constructed  using  the  CuSum  data  to  identify  when 
a  process  is  getting  ‘out  of  control’. 

In  predicting  the  remaining  useful  life  of  a  machine,  previously  developed  models  have 
based  their  approach  on  the  extensive  use  of  reliability  data  coupled  to  a  number  of 
simplifying  models  [1][2][3].  However,  although  the  prediction  is  seldom  precise  enough 
to  be  useful  for  predicting  the  remaining  life  of  individual  machinery,  they  have  been  found 
to  be  useful  for  optimising  maintenance  strategies. 

A  commonly  encountered  reliability  model  applied  to  repairable  systems  is  the  Renewal 
Process.  It  assumes;  i)  that  when  a  machine  fails,  it  is  repaired  perfectly;  i.e.,  ‘As  good  as 
new’,  and  ii)  that  times  between  failure  are  independent  and  identically  distributed.  When 
these  assumptions  hold  true,  the  process  is  said  to  be  stationary  and  a  reliability  model  can 
be  easily  constructed.  A  special  case  of  the  Renewal  Process  is  when  inter-arrival  times  are 
independently  and  exponentially  distributed  with  a  constant  failure  rate.  This  is  known  as 
the  Homogenous  Poisson  Process.  It  is  known  that  the  probability  of  some  arbitrary 
number  of  failures  exhibit  a  Poisson  distribution. 

However,  in  practice,  the  time  to  failure  is  generally  a  function  of  many  variables, 
including;  design,  operating  conditions,  environment,  quality  of  repairs,  etc.  It  follows. 


therefore,  that  failure  times  are  neither  independent  nor  identically  distributed  and  hence, 
the  Renewal  Process  is  very  limited  in  its  scope  for  application  in  this  area. 

All  the  above  models  use  a  single  distribution  function  for  all  the  times  to  failure  over  the 
entire  life  of  the  system.  This  will  not  be  the  case,  since  changes  due  to  deterioration  or 
improvement  cannot  be  modelled  by  a  single  distribution  function  [4].  Hence,  a  stochastic 
point  process  for  a  repairable  system  would  seem  to  be  a  more  appropriate  approach  to 
adopt. 

Christer  and  Waller  [1]  present  a  general  methodology  for  modelling  planned  maintenance 
in  which  they  introduce  the  concept  of  delay  time  analysis  to  model  failure  detection  such 
that  the  time  period  is  estimated  from  when  a  fault  is  detected  to  the  point  of  ultimate 
failure.  However,  they  found  that  reliability  data  was  unsuited  to  this  approach  and  a 
questionnaire  was  used  in  conjunction  with  human  (i.e.,  the  manager’s)  judgement  in  order 
to  successfully  optimise  the  preventive  maintenance  system  at  that  particular  location. 
Weibull  analysis  has  proved  to  be  a  powerful  tool  in  establishing  reliability  -  based  ■' 
diagnosis,  from  which  distinctions  can  be  drawn  between  infant  mortality,  random  failure 
and  wear-out  conditions  of  machinery.  Using  reliability  data  to  predict  the  performance  of 
the  machine  generally  involves  assuming  that  the  historical  performance  will  reflect  the 
current  performance.  The  latter  is  best  measured  by  strategic  use  of  machinery  health 
monitoring  techniques.  Therefore,  the  best  way  to  utilise  this  information  to  predict 
failures  is  by  intelligent  use  of  predetermined  alarm  limits.  An  example  of  the  kind  of 
approach  that  can  be  adopted  comes  from  the  aerospace  industry  where  so  much  of  the 
leading  edge  maintenance  technology  has  been  developed  in  recent  years.  Initially,  a 
deterministic  model  was  derived,  but  it  proved  to  be  ineffective  for  indicating  whether  an 
engine  should  be  overhauled  or  left  in  service.  Sarma  et  al  [5]  derived  a  stochastic  model 
which  incorporated  sample  and  noise  measurement.  This  new  model  proved  more 
successful  in  identifying  problems  and  formed  the  basis  of  a  decision  process  that  indicated 
whether  an  engine  should  remain  in  operation.  More  recently,  Pulkkien  [6],  developed  a 
mathematical  model  of  wear  prediction  in  conjunction  with  monitoring  the  condition  of  a 
single  component. 

A  proportional  hazard  model  has  been  developed,  by  Knapp  and  Wang  [7],  which  predicts 
the  remaining  time  to  failure  of  a  machine.  It  uses  a  baseline  hazard  rate,  stated  as  a 
function  of  time,  and  a  hazard  function  based  on  the  machine  condition  variables.  Upon 
determining  the  hazard  rate,  the  reliability  of  the  machine  is  estimated  over  the  subsequent 
time  period  from  the  current  sample  point. 

From  the  above  description  of  some  relevant  developments,  it  is  evident  that  two  models 
are  required;  one  to  describe  how  the  component  deteriorates;  the  other  to  relate  the 
degree  of  deterioration  in  the  ‘condition’  of  the  equipment  being  monitored. 

b)  Prediction  model  theory 

The  first  part  of  the  model  relates  to  ensuring  the  earliest  identification  of  a  problem.  From 
analysis  of  numerous  failure  data  on  the  hot  strip  mill,  a  general  failure  pattern  becomes 
apparent  which  takes  the  form  shown  in  Figure  1 . 


Figure  1:  General  Failure  Pattern 


In  the  ‘stable  zone’,  measurements  are  simply  varying  about  an  average  value.  The 
variance  may  be  due  to  process  changes  between  successive  measurements  and/or 
measurement  error.  When  the  measurements  start  to  deviate  from  these  values,  it  becomes 
apparent  that  a  problem  exists  and  the  machine  may  have  entered  the  ‘failure  zone’.  The 
setting  of  realistic  alarm  limits  is  achieved  using  SPC  theory,  such  that  when  the  condition 
monitoring  measurements  move  outside  the  limits  imposed,  ( normally  set  at  three 
standard  deviations  about  the  average)  the  condition  is  registered  as  being  ‘unstable’  and 
the  operation  has  entered  the  designated  failure  zone. 

‘Remaining  life’  models  based  solely  on  reliability  theory  are  related  to  time-based 
estimates  from  a  new  (or  repaired)  machine  condition.  If  the  measurement  of  a  machine’s 
condition  is  now  included,  the  overall  failure  time  is  estimated  in  terms  of  detection 
(reliability)  and  failure  prediction  (reliability  plus  condition  monitoring). 

In  quantifiable  terms,  by  using  a  Weibull  distribution  function,  we  obtain: 

For  the  stable  zone: 

For  the  failure  zone: 

7’7F=C2[-/n{l-F(()}]’^-f2 

Each  zone  is  defined  in  terms  of  whether  the  condition  monitoring  measurement  is  inside 
or  outside  the  alarm  limits. 

On  this  basis,  it  is  evident  that  the  ‘condition’  data  acts  as  ‘switch’  or  ‘go/no  go’  signal  in 
moving  from  equation  1  to  equation  2.  However,  in  order  to  make  further  use  of  the 
condition  data,  a  model  of  the  failure  zone  pattern  is  also  introduced.  This  is  depicted  in 
Figure  2. 


The  failure  condition  commences  at  the  lower  limit  (LL),  which  is  the  averaged 
conditional  value  within  the  stable  zone.  The  condition  measurement  (X(t))  increases  until 
it  is  detected  passing  through  the  alarm  limit  (AL).  Subsequently,  at  some  time,  t  =tf,  the 
upper  limit  is  reached  (UL)  and  the  machine  needs  to  be  inspected  or  withdrawn  from 
service. 

Inspection  of  actual  failure  case  histories  revealed  that  the  failure  pattern  could  be 
approximated  to  an  exponential  curve.  While  this  behaviour  cannot  be  said  to  apply  to 
every  situation,  it  nevertheless  serves  as  an  initial  starting  point  for  developing  the 
prediction  model.  Later,  a  wider  spectrum  of  failure  pattern  will  be  introduced  and  the 
model  will  be  adjusted  accordingly. 

Proceeding  on  this  basis,  the  failure  zone  is  expressed  as ; 

/  N  /  s  Inf  VL-U.  1 

A)  =  LL  +  (aL  -  lD  exp (3) 

Values  for  LL  and  AL  are  obtained  from  the  SPC  modelling  of  the  stable  zone.  The 
estimate  of  UL  is  more  problematical  since  it  is  the  maximum  possible  level  the  machine  is 
permitted  to  reach  before  actual  failure  occurs.  UL  must,  therefore,  be  estimated  using 
appropriate  information  available  either  from  within  the  company,  or  from  outside 
sources,  such  as  equipment  suppliers,  or  by  reference  to  universal  standards.  The  time  ‘tf 
is  obtained  by  reference  to  reliability  analysis  of  previous  ‘failures’,  and  is,  therefore, 
obtained  directly  from  Equation  2.  By  rearranging  Equation  3,  an  expression  for  ‘t’  is 
obtained  with  respect  to  the  measured  condition  of  the  machine. 

Hence,  the  remaining  life,  after  entering  the  failure  zone,  is 


TTF  =  If-  t 


(4) 


By  further  substituting  the  values  of ‘t’  and  ‘tf ,  we  obtain; 


TTF=ci{-ln{\ 


X 


(5) 


To  summarise:  in  order  to  predict  the  remaining  life  of  the  machine,  Equation  1  is  used 
while  the  condition  monitoring  measurements  lie  within  the  pre-set  alarm  limits;  i.e.,  in  the 
stable  zone.  When  the  condition  monitoring  measurements  indicate  that  a  problem  has 
occurred,  i.e.,  entered  the  failure  zone.  Equation  5  is  utilised,  in  which  the  time  to  failure  is 
predicted  using  a  combination  of  reliability  and  condition  monitoring  measurements. 

To  best  illustrate  the  way  in  which  the  model  is  designed  to  function  a  computer  program 
was  written  to  simulate  typical  machine  failure  patterns  of  the  type  observed  to  occur 
frequently  in  the  hot  strip  mill.  The  simulated  machine  failure  pattern  comprised  a  stable 
zone  of,  on  average,  20  weeks  duration,  followed  by  a  failure  zone  which  was  also  an 
average  of  20  weeks.  The  effect  on  the  prediction  of  varying  the  time  was  also  assessed, 
and  Figures  3,  4  &  5  show  the  results  obtained  for  three  different  conditions.  In  Figure  3, 
an  ideal  failure  pattern  is  demonstrated.  In  the  stable  zone,  a  wide  distribution  is  obtained 
which  reflects  the  uncertainty  which  accompanies  sole  dependence  on  reliability  data.  In 
the  failure  zone,  the  prediction  rapidly  becomes  much  more  narrow  and  focused, 
eventually  identifying  the  failure  time  as  being  40  weeks  from  start  with  a  very  high 
certainty,  depicted  by  the  increased  ‘sharpness’  of  the  distribution  peak.  The  nearer  the 
time  approaches  the  actual  failure  time,  the  more  certain  is  the  prediction. 


Figure  3:  Illustrating  an  ideal  failure  pattern  at  40  weeks 

In  Figure  4,  the  machine  experiences  a  much  shorter  failure  zone  of  10  weeks,  in  which  it 
is  evident  that  the  model  ‘tracks’  the  time  to  failure  (~  30  weeks)  as  the  later  condition 
monitoring  measurements  are  also  used. 


Figure  4:  Dlustrating  a  shorter  failure  zone,  failure  at  30  weeks 

If  the  stable  zone  time  also  deviates  from  the  ideal  average  time,  a  ‘step’  jump  is  observed 
to  occur  in  the  prediction  distributions,  as  is  demonstrated  in  Figure  5  in  which  the  stable 
zone  only  lasts  for  a  period  of  10  weeks.  Once  again,  in  the  failure  zone,  the  model  tracks 
to  the  point  of  failure  after  50  weeks. 


Figure  5:  Dlustrating  a  longer  stable  zone  time,  failure  at  50  weeks 

Case  Studies:  A  number  of  hydraulic  pumps  located  on  the  hot  strip  mill,  and  subjected 
to  regular  condition  monitoring  using  vibration  analysis,  were  selected  for  an  initial  case 
study. 

The  present  condition  monitoring  methods  and  strategies  used  on  the  mill  are  generally 
very  effective  in  identifying  pumps  which  require  attention  before  a  catastrophic  failure 
occurs.  However,  no  method  currently  exists  for  predicting  the  remaining  useful  life  of  the 
pumps  while  they  are  still  in  operation.  The  condition  monitoring  data  used  in  this  study  is 
based  solely  on  the  measurement  of  overall  vibration  level. 

The  machine  group  selected  for  the  initial  assessment  comprised  three  double  vane  pumps. 
Each  delivering  320  litres  per  minute  of  hydraulic  fluid  at  160  Bar  pressure.  The  pumps 
are  each  driven  by  a  120  kW  electric  motor  at  1485  rev/min.  The  system  supplies  the 


hydraulic  requirements  to  critical  machinery,  including  the  Reversing  Rougher,  Vertical 
and  Horizontal  Scale  Breakers. 

SPC  analysis  of  the  stable  measurements  resulted  in  an  average  condition  measurement 
value  of  6  mm/s  and  an  estimated  alarm  level  of  9.5mm/s.  Subsequent  Weibull  analysis 
revealed  that  the  distribution  approximates  to  a  normal  distribution  with  an  average  stable 
region  time  of  260  days.  The  general  failure  pattern  approximated  well  to  an  exponential 
curve  and,  as  a  result,  the  upper  limit  was  set  at  18mm/s.  The  resulting  time  from  first 
detection  to  reaching  the  upper  limit  of  18mm/s  was  averaged  out  at  104  days.  The 
distributions  for  all  three  pumps  are  presented  in  Figures  6,  7  &  8.  For  pump  No.l,  the 
condition  monitoring  measurements  only  provided  sufiBcient  warning  to  prevent 
catastrophic  failure,  although  the  last  measurement  taken  did  pin-point  correctly  the  failure 
time. 


System  3  Hydraulics  Pump  1 
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Figure  6:  Showing  Pump  1  failure  predictions 

In  the  case  of  Pump  No. 2,  there  was  sufficient  data  available  to  provide  a  more  focused 
prediction  of  time  to  failure. 


Figure  7:  Showing  Pump  2  failure  predictions 


Pump  No.  3  failed  with  a  relatively  clear  failure  pattern  which  is  reflected  in  an  accurate 
prediction  to  time  to  failure  from  an  early  stage. 


Figure  8:  Showing  Pump  3  failure  predictions 


Regarding  the  matter  of  deciding  the  alarm  limits;  initially,  all  three  methods  for  detecting 
a  machine  problem  were  used.  In  the  stable  region,  the  measurements  were  well  within  the 
normal  alarm  limits,  but  when  the  measurements  were  closer  to  these  limits,  quite  a  large 
number  of  false  alarms  were  observed;  accounting  for  about  10%  of  the  total  number  of 
measurements.  By  changing  to  the  limits  set  using  SPC,  the  number  reduced  to  5%.  Using 
CuSum  analysis  resulted  in  a  further  improvement  to  only  2.5%  false  alarms.  However,  in 
the  failure  zone,  the  CUSum  method  led  to  large  variation  in  the  measurements  which 
made  it  difficult  to  estimate  properly  the  average  level  and  associated  distribution.  CuSum 
was,  therefore,  judged  to  be  best  utilised  in  identifying  when  the  machine  problem  first 
occurred.  SPC  analysis  in  this  zone  maintained  the  level  of  false  alarms,  but  at  the  expense 
of  reduced  reaction  time  to  failure.  Using  the  normal  limits  still  resulted  in  a  higher 
percentage  of  false  alarms  but  it  indicated  a  failure  slightly  sooner  than  the  SPC  analysis. 

Conclusions:  Currently  in  industry,  condition  monitoring  can  identify  when  machine 
problems  are  occurring  and,  given  enough  experience,  pin  point  the  exact  cause. 

However,  it  is  more  difficult  to  predict  the  remaining  life  of  the  machine  once  the  problem 
has  been  identified  and  therefore  when  to  change  or  maintain  the  machine. 

Current  literature  on  remaining  life  prediction  has  focused  on  solely  reliability  based  or 
mathematically  complex  models.  There  is  clearly  a  need  for  a  simple,  systematic 
prediction  model  readily  applicable  to  the  industrial  situation. 

This  paper  has  attempted  to  introduced  such  a  model.  Condition  monitored  measurements 
have  been  divided  into  two  regions;  a  stable  and  failure  zone.  Whilst  in  the  stable  zone, 
condition  measurements  are  normal  and  hence  a  reliability  based  model  is  utilised.  When 
condition  measurements  increase,  indicating  a  potential  problem,  reliability  and  condition 
monitoring  information  is  used  to  form  the  remaining  machine  life  prediction. 


A  case  study  was  carried  out  to  test  the  model.  Initial  results  were  encouraging  with  all 
machine  failures  being  predicted  before  they  failed.  It  was  evident  that  the  prediction 
model  was  dependent  on  the  quality  and  accuracy  of  the  condition  monitored 
measurements. 

It  is  anticipated  the  model  will  be  applicable  to  most  condition  monitored  situations 
provided  that  the  failure  lead  time  is  sufficiently  long  and  the  condition  monitoring  reflects 
the  health  of  the  machine. 
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